Big Data Analysis in Finance

Module 2

Author
Affiliation

Prof. Matthew G. Son

University of South Florida

Big Data

Intro

Big data in business

“Without big data, you are blind and deaf and in the middle of a freeway.” — Geoffrey Moore

“In God we trust, all others bring data.” — W. Edwards Deming

Big Data

What is big data?

  1. Volume: the massive amount (size) of data
  2. Variety (Complexity): the diversity and complexity of data
  3. Velocity: the speed of data generation

Big Data by Volume

A rough definition of big data, by volume:

  • If the data fits your memory: small data

  • If the data is bigger than memory but less than hard disk: medium data

  • If the data is even greater than normal disk: big data

How is Big Data Managed?

(Massively) Large datasets are stored in different ways.

  1. Database

  2. Data Warehouse

  3. Data Lake

Database

  • A well-organized file cabinet

  • Structured

  • Usually managed by single provider

  • Many database uses (accepts) SQL

  • e.g., Mergent, MSRB, IvyDB

Data Warehouse

  • A huge library archive
  • Data from various sources
  • e.g. photo, video, voice, documents (databases)
  • (Semi) structured
  • Optimized for looking up, not for quick updates
  • e.g., WRDS, Google BigQuery, Amazon RedShift, Snowflake

Data lake

  • A data pool, or a dumping site

  • Unstructured, not organized

    • Can be structured / semi-structured within
  • Data from various sources, can be raw

  • Great way to store massive amount of data, quickly

  • e.g., AWS S3, Google Cloud Storage, Git LFS

Views on Big Data Approach

If your data fits in memory, there’s no advantage to putting it in a database: it will only be slower and more frustrating.

-Hadley Wickham, Chief scientist @Posit

Tip
  • Some backend engines challenges this view (duckplyr)

Challenges in Big Data

Many times we don’t have enough power to handle big data:

  • too big to fit in memory

  • You can’t use the same toolbox in R/Python

  • So many choices to consider

How do we handle Big Data?

Two approaches:

  1. Shrink the data

  1. Scale the machine

Common Big Data Solutions

Big data problems are often described as “small data problems in disguise”, meaning:

Often what we care is the subset of the large data.

When data is stored in well structured way (i.e., database)

  • we can bypass loading the whole data to memory

  • but read only what is relevant.

Approach 1: Shrink the data

  • Database querying

    • Process heavy lifting on the server
    • Or out-of-core (disk)
    • Retrieve only what you need
  • Chunk processing

    • Read, process, write little by little
  • Downsampling

    • Random sampling
    • Temporal

Approach 2: Scaling up

  • Cloud computing (Virtual Machines / Containers)

  • High performance clusters (HPC)

Cloud Computing

  • On-Demand: Provisioned dynamically, on-demand
  • Scalability: Easily scale up or down
  • Flexibility: Ideal for diverse applications
  • Managed Services: Provided and maintained by commercial vendors
  • Pricing:
    • Pay-as-you-go
    • Economical short-run, expensive long-running jobs

High Performance Computing (HPC)

  • Dedicated: Resources reserved for a job’s duration
  • Optimized for Performance: Designed for tightly coupled, parallel processing tasks
  • Research-Centric: For intensive simulations and analyses
  • Pricing:
    • Fixed upfront / subscriptions
    • Often supported by academic or research institutions

Cloud Computing

Well-known cloud computing providers:

  1. AWS (Amazon Web Services)
  2. Microsoft Azure
  3. Google Cloud Platform (GCP)
  4. Oracle Cloud

HPC at USF

CIRCE

  • Managed by Research Computing department

  • Two main clusters: CIRCE and Secure Cluster for sensitive data

  • Access through JIRA or email request

USF Research Computing Documentation

Database

A database is a collection of data that is structured and organized.

  • Database are provided remotely but can be stored locally.

Database in a nutshell:

A filing cabinet arranges items by alphabetical order:

Files starting “ABC” in the top drawer, “DEF” in the second drawer, etc.

To find Alice’s file, you’d only have to search the top draw.

For Fred, the second draw, and so on.

Types of database

  1. Relational database (RDBMS)

    • Traditional, often called SQL databases

    • Data stored in tabular form

    • Optimal for data not often changing

    • When accuracy / consistency is crucial (Financial data)

    • example: PostgreSQL, MySQL


  1. Non-relational database (NoSQL)

    • Often called NoSQL databases

    • Data stored in formatted text form (e.g., JSON)

    • For complex and diverse, changing data to be organized

    • e.g.) MongoDB

Relational Databases (RDBMS)

Source: https://xbsoftware.com/blog/main-types-of-database-management-systems/

Mergent FISD Example

Mergent FISD (Fixed Income Securities Database)


Issue table


Issuer table

DBMS computational forms

  • Client-server DBMS: run on a server within an organization

    • WRDS
  • Cloud DBMS: Similar to client-server DBMS, but on cloud

    • Snowflake, Google BigQuery, Amazon RedShift
  • In-process DBMS: run entirely on your computer

    • DuckDB, SQLite

Suggested Reading

Suggested reading

  1. Hadley, “R for Data Science” 2ed,
    1. Ch. 22. Databases